-
Notifications
You must be signed in to change notification settings - Fork 2k
[ENH] Add python & js client support to query on subset of IDs #4250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
89174d3 to
3323962
Compare
699620e to
4a0795f
Compare
49df6bb to
194e27b
Compare
| self, | ||
| collection_id: UUID, | ||
| query_embeddings: Embeddings, | ||
| ids: Optional[IDs] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add to docstring(?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added, thanks
| self, | ||
| collection_id: UUID, | ||
| query_embeddings: Embeddings, | ||
| ids: Optional[IDs] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add to telemetry capture call below? just observing the pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added, thanks
ddb0949 to
cac7f29
Compare
| self, | ||
| collection_id: UUID, | ||
| query_embeddings: Embeddings, | ||
| ids: Optional[IDs] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstring(?)
| query_texts: Optional[OneOrMany[Document]] = None, | ||
| query_images: Optional[OneOrMany[Image]] = None, | ||
| query_uris: Optional[OneOrMany[URI]] = None, | ||
| ids: Optional[IDs] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about the OneOrMany pattern here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a fan, because why would someone want to filter on 1 id? it makes it a little more confusing on what it does. also the rust takes in a list of ids, so i'd prefer maintaining that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's very reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep consistency with other APIs
|
i think i'm missing something, so for my edification, which part of the code actually processes the EDIT: Sorry that might be a really broad question, I don't know how the rust code works. |
|
it would be cool to add filter syntax to |
|
@jairad26 would it be helpful for me to put together a PR to add this feature to the docs? |
yep that's exactly the spot! all that does is filter by ids pre-vector search.
yes that would be nice, but would only really work if the ids are ints. not 100% sure but i dont think uuids for example guarantee order like that. also random string generator definitely wouldn't work with that
I'm moving this PR to draft for now, since i want to add support for the js client and that requires a couple other PRs to get merged. after that I can prob handle the docs. Thank you though! |
cac7f29 to
5a95f79
Compare
99a1335 to
2239c58
Compare
| WhereDocument, | ||
| ) | ||
| from chromadb.test.conftest import reset, NOT_CLUSTER_ONLY | ||
| import chromadb.test.property.strategies as strategies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add explicit unit tests as well with small, medium size data
HammadB
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally is good. Please add more explicit unit testing, especially around edge cases like deleted ids, upserted ids.
Also I think OneOrMany is preferable for API consistency, even though unlikely to be used.
621b98a to
fc2f354
Compare
fc2f354 to
bb6c085
Compare
907ec35 to
b4d8d7b
Compare
| random_query = normalized_record_set["embeddings"][ | ||
| random.randint(0, total_count - 1) | ||
| ] | ||
| # Use data.draw to select index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this change
| client.get_settings().chroma_api_impl | ||
| == "chromadb.api.async_fastapi.AsyncFastAPI" | ||
| ): | ||
| pytest.skip( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside - when can we remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i haven’t been able to find a root cause, will set aside a couple hours for it this week
| ids_to_query = data.draw( | ||
| st.lists( | ||
| st.sampled_from(normalized_record_set["ids"]), | ||
| min_size=ids_subset_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why choose id_subset_size when min_size and max_size do the same thing? This introduces more entropy requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yea ok, will make it create more diverse sets of data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I think its the same thing is my point. St.lists() choosing between min and max size is equivalent to data.draw(st.integers(min_value=0, max_value=total_count))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh got it okay. removing id_subset_size.
it's this now
ids_to_query = data.draw(
st.lists(
st.sampled_from(normalized_record_set["ids"]),
min_size=0,
max_size=total_count,
unique=True,
)
)
b4d8d7b to
03ccf12
Compare
…a-core#4250) ## Description of changes This PR adds python client and async python client support to query on a filtered set of IDs example: ``` ids = ["1", "2", "3"] documents = ["test", "test2", "apple"] metadatas = [{"source": "test"}, {"source": "test2"}, {"source": "apple"}] coll.add( ids=ids, documents=documents, metadatas=metadatas, # embeddings=numpy_embeddings ) output = coll.query( ids=["1", "3"], query_texts=["test"], n_results=3, include=["documents", "metadatas", "distances"] ) print(output) ``` This will output % python test_filter_id.py {'ids': [['1', '3']], 'embeddings': None, 'documents': [['test', 'apple']], 'uris': None, 'included': ['documents', 'metadatas', 'distances'], 'data': None, 'metadatas': [[{'source': 'test'}, {'source': 'apple'}]], 'distances': [[0.0, 0.7396076321601868]]} ## Test plan *How are these changes tested?* - [x] Tests pass locally with `pytest` for python, `yarn test` for js, `cargo test` for rust - Added prop tests to test with other filtering and on its own ## Documentation Changes *Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs repository](https://github.com/chroma-core/docs)?*

Description of changes
This PR adds python client and async python client support to query on a filtered set of IDs
example:
This will output
% python test_filter_id.py
{'ids': [['1', '3']], 'embeddings': None, 'documents': [['test', 'apple']], 'uris': None, 'included': ['documents', 'metadatas', 'distances'], 'data': None, 'metadatas': [[{'source': 'test'}, {'source': 'apple'}]], 'distances': [[0.0, 0.7396076321601868]]}
Test plan
How are these changes tested?
pytestfor python,yarn testfor js,cargo testfor rustDocumentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?